World Health Organization

Context

Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

Column Names: Meanings

  • Year: Year
  • Status: Developed or Developing status
  • Life expectancy: Life Expectancy in age
  • Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
  • infant deaths: Number of Infant Deaths per 1000 population
  • Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
  • percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
  • Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
  • Measles:Measles - number of reported cases per 1000 population
  • BMI: Average Body Mass Index of entire population
  • Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
  • Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
  • Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
  • HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
  • GDP: Gross Domestic Product per capita (in USD)
  • Population: Population of the country
  • thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (%)
  • thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)
  • income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
  • schooling: Number of years of Schooling(years)

ETL

Prior to running the models, the data was manipulated to remove nulls, adjusted for inconsistencies in the data, and eliminating columns that were not required.

  1. Removing Nan values This proved to have higher R^2 values and lower number of variables that were considered to be significant. While the model appears to have a better fit, this is not the ideal scenario since ~30% of the data was lost in this process.

  2. Filter out percentage.expenditure greather than 100. This caused additional values to be considered in significant. This is filtered considering that a population cannot spend more than 100% of their GDP on health care.

  3. Replace Nan with means This allows for all data to be conserved.

## [1] 0

1. Does various predicting factors which has been chosen initially really affect the Life expectancy? What are the predicting variables actually affecting the life expectancy?

After running multiple regression models (forward, backward, and stepwise), it was determined that the following variables are the predictors that are statically significant in regards to life expectancy, where stepwise and backward had the same results.

Variables StatusDeveloping -1.923054
Adult.Mortality -0.016917
infant.deaths 0.086154
under.five.deaths -0.067124
Total.expenditure 0.250440
Diphtheria 0.028511
HIV.AIDS -1.020423
Income.composition.of.resources 27.870308

adjusted R^2: 0.83 AIC: 992.018

While the other variables intuitively may appear significant, the variables above are the predictors that have significant impact according to the regression models. interestingly enough, all models resulted in the same variables being returned.

## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Total.expenditure + 
##     Diphtheria + HIV.AIDS + GDP + thinness.5.9.years + Income.composition.of.resources + 
##     Life.expectancy.category, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1994  -1.8642   0.0611   1.6677   8.9195 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.170e+01  3.177e+00  19.423  < 2e-16 ***
## StatusDeveloping                -2.420e+00  8.076e-01  -2.997  0.00313 ** 
## Adult.Mortality                 -1.459e-02  3.633e-03  -4.017 8.77e-05 ***
## Total.expenditure                2.649e-01  9.709e-02   2.729  0.00702 ** 
## Diphtheria                       1.770e-02  1.160e-02   1.526  0.12872    
## HIV.AIDS                        -5.715e-01  2.472e-01  -2.312  0.02197 *  
## GDP                              2.888e-05  1.706e-05   1.693  0.09222 .  
## thinness.5.9.years              -1.700e-01  6.955e-02  -2.444  0.01553 *  
## Income.composition.of.resources  1.895e+01  3.274e+00   5.787 3.30e-08 ***
## Life.expectancy.categoryLow     -5.693e+00  1.008e+00  -5.649 6.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.286 on 173 degrees of freedom
## Multiple R-squared:   0.86,  Adjusted R-squared:  0.8527 
## F-statistic:   118 on 9 and 173 DF,  p-value: < 2.2e-16
## [1] 966.4646

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling + Life.expectancy.category, 
##     data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1058  -1.6731   0.2159   1.5864   8.8874 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.250e+01  3.433e+00  18.206  < 2e-16 ***
## StatusDeveloping                -2.137e+00  8.647e-01  -2.472 0.014480 *  
## Adult.Mortality                 -1.494e-02  3.812e-03  -3.919 0.000131 ***
## infant.deaths                    3.532e-02  5.627e-02   0.628 0.531077    
## Alcohol                          1.012e-01  8.060e-02   1.256 0.210921    
## percentage.expenditure           5.173e-05  2.357e-04   0.219 0.826571    
## Hepatitis.B                     -1.146e-02  2.112e-02  -0.543 0.588053    
## Measles                         -2.369e-05  4.766e-05  -0.497 0.619749    
## BMI                             -5.531e-03  1.562e-02  -0.354 0.723766    
## under.five.deaths               -2.839e-02  3.896e-02  -0.729 0.467238    
## Polio                            2.512e-05  1.945e-02   0.001 0.998971    
## Total.expenditure                2.299e-01  1.041e-01   2.209 0.028587 *  
## Diphtheria                       2.606e-02  2.378e-02   1.096 0.274780    
## HIV.AIDS                        -5.667e-01  2.585e-01  -2.192 0.029778 *  
## GDP                              2.068e-05  3.627e-05   0.570 0.569358    
## Population                       2.374e-09  6.398e-09   0.371 0.711052    
## thinness..1.19.years             7.806e-03  2.346e-01   0.033 0.973500    
## thinness.5.9.years              -1.774e-01  2.326e-01  -0.763 0.446740    
## Income.composition.of.resources  1.880e+01  5.699e+00   3.299 0.001195 ** 
## Schooling                       -3.827e-02  2.313e-01  -0.165 0.868812    
## Life.expectancy.categoryLow     -5.430e+00  1.118e+00  -4.858 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.351 on 162 degrees of freedom
## Multiple R-squared:  0.8636, Adjusted R-squared:  0.8468 
## F-statistic:  51.3 on 20 and 162 DF,  p-value: < 2.2e-16

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## Call:
## lm(formula = Life.expectancy ~ Status + Adult.Mortality + Total.expenditure + 
##     Diphtheria + HIV.AIDS + GDP + thinness.5.9.years + Income.composition.of.resources + 
##     Life.expectancy.category, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.1994  -1.8642   0.0611   1.6677   8.9195 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      6.170e+01  3.177e+00  19.423  < 2e-16 ***
## StatusDeveloping                -2.420e+00  8.076e-01  -2.997  0.00313 ** 
## Adult.Mortality                 -1.459e-02  3.633e-03  -4.017 8.77e-05 ***
## Total.expenditure                2.649e-01  9.709e-02   2.729  0.00702 ** 
## Diphtheria                       1.770e-02  1.160e-02   1.526  0.12872    
## HIV.AIDS                        -5.715e-01  2.472e-01  -2.312  0.02197 *  
## GDP                              2.888e-05  1.706e-05   1.693  0.09222 .  
## thinness.5.9.years              -1.700e-01  6.955e-02  -2.444  0.01553 *  
## Income.composition.of.resources  1.895e+01  3.274e+00   5.787 3.30e-08 ***
## Life.expectancy.categoryLow     -5.693e+00  1.008e+00  -5.649 6.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.286 on 173 degrees of freedom
## Multiple R-squared:   0.86,  Adjusted R-squared:  0.8527 
## F-statistic:   118 on 9 and 173 DF,  p-value: < 2.2e-16
##          CV         AIC        AICc         BIC       AdjR2 
##  11.7114475 447.1331296 448.6769892 482.4374772   0.8526682

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

2.Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan?

We assume that the collected data set is not accurate for percentage expenditure column. If the definition of percentage expenditure is Expenditure on health as a percentage of Gross Domestic Product per capita(%), then it can never be more than 100%. Also, it is higly unlikely that any country would spend 100% of its GDP on healthcare.

So, ignoring percentage expenditure column in analysis. We are using only Total expenditure for finding out if it has an impact on life expectancy value.

From EDA and 2-sample t-test, we see that life expectancy value does not have statisctical significance on healthcare expenditure.

t = 1.9583, df = 181, p-value = 0.05173

95 percent confidence interval: -0.00710969, 1.88675914 Since 0 is one of the plausible values, so we can say that effect of Total expenditure on life expectancy value greater than 65 and less than 65 is not statistically significant.

## 
##  Two Sample t-test
## 
## data:  Total.expenditure by Life.expectancy.category
## t = 1.9583, df = 181, p-value = 0.05173
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.00710969  1.88675914
## sample estimates:
## mean in group High  mean in group Low 
##           6.411556           5.471732
## 
## Classification tree:
## tree(formula = Life.expectancy.category ~ Total.expenditure, 
##     data = df1_complete)
## Number of terminal nodes:  6 
## Residual mean deviance:  0.9451 = 167.3 / 177 
## Misclassification error rate: 0.1858 = 34 / 183

3.How does Infant and Adult mortality rates affect life expectancy?

From EDA, effect of Infant death does not look significant on life expectancy. But we see that Adult.Mortality rate is negatively coreelated Life Expectancy,

The relationship between Adult.Mortality and life expectancy can be modeled by the regression equation below:

                      life expectancy = 80.64428 + -0.06125 (Adult Mortality)

We notice that as Adult.Mortality value increase by an one, life expectancy is expected to decrese by 0.06125 years. With no Adult.Mortality, a life expectancy is exepcted to be 80.64428 years. Adjusted R^2: 0.5732

And when we used Adult.Mortality and Infant death together, it did not meet linear regression model assumption.

## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9654  -2.5457   0.8639   3.2843  13.1335 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     80.64428    0.71342  113.04   <2e-16 ***
## Adult.Mortality -0.06125    0.00391  -15.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.593 on 181 degrees of freedom
## Multiple R-squared:  0.5755, Adjusted R-squared:  0.5732 
## F-statistic: 245.4 on 1 and 181 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + infant.deaths, 
##     data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.4677  -2.6017   0.6429   3.1007  12.9854 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     80.675651   0.706482 114.193   <2e-16 ***
## Adult.Mortality -0.059758   0.003933 -15.194   <2e-16 ***
## infant.deaths   -0.010334   0.004791  -2.157   0.0323 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.537 on 180 degrees of freedom
## Multiple R-squared:  0.5862, Adjusted R-squared:  0.5816 
## F-statistic: 127.5 on 2 and 180 DF,  p-value: < 2.2e-16

4.Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol etc.

We assume that BMI rate is related to lifestyle, eating habits, exercise. Since we see a positive co-relation of BMI with Life Expectancy, we assume that with good eating habits, ample exercice and healthy lifestyle life expectancy would be more.

The summary statistics shows high significants of alpha less than 0.001. The relationship can be modeled by the regression equation below:

                      life expectancy = 63.57 + 0.19423 (BMI)

We notice that as BMI values increase by an unit, life expectancy is expected to increase by 0.19423 years. With no BMI, a life expectancy is exepcted to be 63.57 years (which practially does not make sense). Adjusted R^2: 0.2226

Note: Alcohol effect is covered in question: 6.

## 
## Call:
## lm(formula = Life.expectancy ~ BMI, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0898  -4.6841   0.3523   4.3841  24.0927 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 63.56713    1.22767  51.779  < 2e-16 ***
## BMI          0.19423    0.02665   7.288 9.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.548 on 181 degrees of freedom
## Multiple R-squared:  0.2269, Adjusted R-squared:  0.2226 
## F-statistic: 53.11 on 1 and 181 DF,  p-value: 9.42e-12

5.What is the impact of schooling on the lifespan of humans?

Life expectancy and schooling have a positive linear relationship. Despite not being significant in the full model, schooling is a significant indicator when its modeled as a singly linear regression.

The summary statistics shows high significants of alpha less than 0.001. The relationship can be modeled by the regression below:

                      life expectancy = 41.42 + 2.34 (schooling)

We notice that as schooling increases by a year, life expectancy is increased by 2.5 years, with no schooling having a life expectancy of 38.72 years. Adjusted R^2: 0.5949

## 
## Call:
## lm(formula = Life.expectancy ~ Schooling, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7366  -2.9955   0.4148   3.7260  11.2971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.4171     1.8826   22.00   <2e-16 ***
## Schooling     2.3371     0.1427   16.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.449 on 181 degrees of freedom
## Multiple R-squared:  0.5971, Adjusted R-squared:  0.5949 
## F-statistic: 268.2 on 1 and 181 DF,  p-value: < 2.2e-16

6.Does Life Expectancy have positive or negative relationship with drinking alcohol?

Life expectancy has a slight positive increase with alcohol consumption.

                      life expectancy = 68.03 + 1.07(Alcohol)

Notice that as alcohol consumption increases, life expectancy incrases by 1.07 years, starting at 68.01 years expected if no alcohol is consumed.

This is a bit counter intuitive considering the knowledge that alcohol is not considered to be healthy and many studies suggest that alcohol could shorten life spans. Keeping this in coonsideration, additional studies may need to be conducted.

In addition, the model does not meet the require assumptions for linear regression. Specifically, the model lacks constant variance, hence it is not correct.

## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.2658  -4.4254   0.7636   5.5977  15.4636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  68.0257     0.6916  98.362  < 2e-16 ***
## Alcohol       1.0732     0.1312   8.179 4.89e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.335 on 181 degrees of freedom
## Multiple R-squared:  0.2699, Adjusted R-squared:  0.2658 
## F-statistic:  66.9 on 1 and 181 DF,  p-value: 4.887e-14

7.Do densely populated countries tend to have lower life expectancy?

Initally, there is a significant outlier that skews the data dramatically. Once removed, we noticed the sloped droped by nearly 50%. Considering that there is the possibility of having significanlt high populations, it has been concluded to keep the data point in.

                      life expectancy = 71.71 - 1.12e-8(Population)

Notice that as the population in a country increases, life expectancy decreases by 2.65 years, starting at 70.58 years expected if there is no population. In this scenario, the intercept independent of the slope has no logical reasoning considering that no population would result in no life expectancy.

We notice again that the model fails constant variance, hence, the model cannot be utilized.

## 
## Call:
## lm(formula = Life.expectancy ~ Population, data = df1_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.507  -5.987   1.991   5.313  17.409 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.161e+01  6.484e-01  110.44   <2e-16 ***
## Population  -3.475e-09  6.440e-09   -0.54     0.59    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.578 on 181 degrees of freedom
## Multiple R-squared:  0.001606,   Adjusted R-squared:  -0.00391 
## F-statistic: 0.2911 on 1 and 181 DF,  p-value: 0.5902

## 
## Call:
## lm(formula = Life.expectancy ~ Population, data = newdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.606  -6.029   2.172   5.398  17.347 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.171e+01  7.120e-01   100.7   <2e-16 ***
## Population  -1.128e-08  2.257e-08    -0.5    0.618    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.598 on 180 degrees of freedom
## Multiple R-squared:  0.001387,   Adjusted R-squared:  -0.004161 
## F-statistic:  0.25 on 1 and 180 DF,  p-value: 0.6177

8.What is the impact of Immunization coverage on life Expectancy?

Reviewing the graphs, there does not appear to be a significant relationship between life expectancy and immunization.

  1. linear regession is not applicable due to the fact that the data doesn’t match the required assumptions.
  • Fails constant variance
  1. log transformations also did not satisfy the required assumptions for regression
  • Fails constant variance
  1. polynomial regression matched the required assumptions with a low R^2 (0.14) and high AIC (1280) Life.expectancy = 68.85 - 0.27 Hepatitis.B + 0.003 I(Hepatitis.B^2)
## 
## Call:
## lm(formula = Life.expectancy ~ ., data = df_immunizations)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1606  -5.6656   0.6302   4.6017  24.9821 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.68785    2.54659  22.260  < 2e-16 ***
## Hepatitis.B -0.01012    0.04671  -0.217  0.82871    
## Polio        0.11826    0.04232   2.794  0.00577 ** 
## Diphtheria   0.06744    0.05293   1.274  0.20425    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.83 on 179 degrees of freedom
## Multiple R-squared:  0.1772, Adjusted R-squared:  0.1634 
## F-statistic: 12.85 on 3 and 179 DF,  p-value: 1.212e-07

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## Call:
## lm(formula = log.Life.expectancy ~ log.Hepatitis.B, data = df_immunizations)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39440 -0.07385  0.02842  0.07380  0.23821 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.05582    0.06302  64.358   <2e-16 ***
## log.Hepatitis.B  0.04795    0.01446   3.317   0.0011 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1211 on 181 degrees of freedom
## Multiple R-squared:  0.0573, Adjusted R-squared:  0.05209 
## F-statistic:    11 on 1 and 181 DF,  p-value: 0.0011

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## Call:
## lm(formula = Life.expectancy ~ Hepatitis.B + I(Hepatitis.B^2), 
##     data = df_immunizations)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6631  -5.3127   0.0933   4.5659  19.2030 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      68.8849360  2.9985672  22.973  < 2e-16 ***
## Hepatitis.B      -0.2710250  0.1123556  -2.412 0.016861 *  
## I(Hepatitis.B^2)  0.0033928  0.0009535   3.558 0.000477 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.902 on 180 degrees of freedom
## Multiple R-squared:  0.1574, Adjusted R-squared:  0.148 
## F-statistic: 16.81 on 2 and 180 DF,  p-value: 2.027e-07
## [1] 1280.866
## [1] 1293.704

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

Objective 1

Display the ability to build regression models using the skills and discussions from Unit 1 and 2 with the purpose of identifying key relationships, interpreting those relationships, and making good predictions.

Reminder, key here is to tell a good story.

Build Model 1

  • Identify key relationships
  • Ensure interpretability

Unit 2 Objectives: - bias vs. variance - complexity - LASSO/LARS, CV (cross validation)

  1. Perform regression analysis
  • LASSO Using the lasso regression method, the following were determined to be significant Income.composition.of.resources While there were other variables that made the model (total expenditure, HIV.AIDs, and BMI), their coefficients are extremely small and not considered to be significant.

                Life expectancy = 37.72 + 49.11 (income)

    R^2: 0.7317

CV: 9.896252

Notice that there is a drop in R^2 as opposed to the linear model, but there is significantly less complexity.

## [1] 9.782965
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                                             1
## (Intercept)                     62.1431290847
## StatusDeveloping                -1.9342221405
## Adult.Mortality                 -0.0188318537
## infant.deaths                    .           
## Alcohol                          0.0024896480
## percentage.expenditure           .           
## Hepatitis.B                      .           
## Measles                          .           
## BMI                              .           
## under.five.deaths               -0.0006208113
## Polio                            .           
## Total.expenditure                0.4010933313
## Diphtheria                       0.0031529074
## HIV.AIDS                         .           
## GDP                              .           
## Population                       .           
## thinness..1.19.years             .           
## thinness.5.9.years              -0.1168038289
## Income.composition.of.resources 18.5336459933
## Schooling                        .           
## Life.expectancy.categoryLow     -7.2540778170

##       RMSE   Rsquare
## 1 3.127773 0.8540002
## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources, 
##     data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7366  -2.0664   0.0655   2.3525  10.4634 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       37.723      1.551   24.32   <2e-16 ***
## Income.composition.of.resources   49.119      2.202   22.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.434 on 181 degrees of freedom
## Multiple R-squared:  0.7332, Adjusted R-squared:  0.7317 
## F-statistic: 497.4 on 1 and 181 DF,  p-value: < 2.2e-16

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

2. Report predictive ability a. Test/train set b. CV data

##                                    2.5 %   97.5 %
## (Intercept)                     33.81911 40.36032
## Income.composition.of.resources 45.63222 54.82424
## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources, 
##     data = trainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3665  -1.9079   0.1418   2.1354  10.3335 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       37.090      1.655   22.41   <2e-16 ***
## Income.composition.of.resources   50.228      2.325   21.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.237 on 144 degrees of freedom
## Multiple R-squared:  0.7642, Adjusted R-squared:  0.7625 
## F-statistic: 466.6 on 1 and 144 DF,  p-value: < 2.2e-16
## [1] 839.9218
## [1] 848.8726
##          CV         AIC        AICc         BIC       AdjR2 
##  18.1672688 425.5917578 425.7607719 434.5425777   0.7625353
##          fit      lwr      upr
## 6   78.52801 77.61430 79.44173
## 17  72.50062 71.80614 73.19511
## 22  72.09880 71.40566 72.79193
## 28  71.66649 70.97263 72.36035
## 29  69.38647 68.65264 70.12031
## 34  56.67873 55.11308 58.24438
## 52  71.04401 70.34516 71.74285
## 53  66.32255 65.45521 67.18990
## 56  58.93900 57.55577 60.32223
## 62  75.21295 74.46083 75.96507
## 67  67.92986 67.14211 68.71760
## 69  58.13535 56.68801 59.58269
## 79  82.79741 81.59214 84.00268
## 82  73.60565 72.89755 74.31374
## 84  74.10793 73.38901 74.82685
## 89  70.03944 69.32301 70.75587
## 90  65.87050 64.97767 66.76332
## 92  75.41386 74.65442 76.17331
## 96  79.33166 78.36842 80.29491
## 101 71.89789 71.20470 72.59108
## 113 68.83396 68.08200 69.58592
## 114 64.76548 63.80570 65.72526
## 117 69.03488 68.28986 69.77990
## 118 54.41846 52.66497 56.17195
## 119 63.25863 62.19879 64.31847
## 120 84.55540 83.21533 85.89547
## 124 62.75635 61.66132 63.85137
## 125 71.64674 70.95280 72.34069
## 143 75.56455 74.79936 76.32973
## 153 75.26318 74.50926 76.01709
## 155 73.35450 72.65075 74.05826
## 160 68.33168 67.56073 69.10263
## 172 78.87961 77.94462 79.81461
## 176 76.82025 75.99976 77.64075
## 177 71.74720 71.05365 72.44075
## 178 67.02575 66.19549 67.85600
## 183 62.10338 60.96147 63.24529

Comparing the models

## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources, 
##     data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7366  -2.0664   0.0655   2.3525  10.4634 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       37.723      1.551   24.32   <2e-16 ***
## Income.composition.of.resources   49.119      2.202   22.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.434 on 181 degrees of freedom
## Multiple R-squared:  0.7332, Adjusted R-squared:  0.7317 
## F-statistic: 497.4 on 1 and 181 DF,  p-value: < 2.2e-16
  1. Interpret the coefficients Life expectancy has a linear relationship iwth income/composition of resource. where as the income index increases, life expectancy increases by 50.2. It should be noted that income/composition of resources ranges from 0 to 1, where the maximum life expectancy if 87.3. This is not fully realistic considering that many may live past this age. In addition, if there is no income or composition of resources, it is expected that the life expectancy is 37.1. While this also may not be applicable, this could pertain to those who are considered unemployed without any income source.

  2. Confidence intervals

##                                    2.5 %   97.5 %
## (Intercept)                     33.81911 40.36032
## Income.composition.of.resources 45.63222 54.82424
  1. Practical and statistical significance The income index is the most significant predictor for life expectancy, explaining more than 70% of the data.

Model 2

- Product the best predictions as possible
- Interpretation is no longer required, hence complexity is no longer an issue
  1. Feature selection to avoid overfitting

A. Linear Regression - model a: linear regression life expectancy = 36.55 + 50.73(income) Adjusted R^2 = 0.79 - model b: linear regression + adult mortality life expectancy = 48.5 + 38.77(income) - 0.025 (adult mortality) Adjusted R^2 = 0.84 - model c: linear regression + adult mortality + HIV.AIDS life expectancy = 49.8 + 36.05 (income) - 0.016 (adult mortality) - 0.95 (HIV/AIDS) Adjusted R^2 = 0.85 B. Interaction Terms - model d: linear regression + adult mortality + HIV.AIDS

## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources, 
##     data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7366  -2.0664   0.0655   2.3525  10.4634 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       37.723      1.551   24.32   <2e-16 ***
## Income.composition.of.resources   49.119      2.202   22.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.434 on 181 degrees of freedom
## Multiple R-squared:  0.7332, Adjusted R-squared:  0.7317 
## F-statistic: 497.4 on 1 and 181 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources + 
##     Adult.Mortality, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.3882  -1.6648   0.1117   1.9530  10.2110 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     50.357043   2.307517  21.823  < 2e-16 ***
## Income.composition.of.resources 36.398916   2.705873  13.452  < 2e-16 ***
## Adult.Mortality                 -0.026076   0.003809  -6.847 1.15e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.961 on 180 degrees of freedom
## Multiple R-squared:  0.7883, Adjusted R-squared:  0.786 
## F-statistic: 335.2 on 2 and 180 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources + 
##     Adult.Mortality + Total.expenditure, data = df.expenditure)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.4690  -1.6843   0.4308   2.5825   8.6144 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     48.874569   3.752761  13.024  < 2e-16 ***
## Income.composition.of.resources 34.237635   4.879284   7.017 5.66e-10 ***
## Adult.Mortality                 -0.022318   0.005845  -3.818 0.000258 ***
## Total.expenditure                0.327685   0.185466   1.767 0.080934 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.718 on 83 degrees of freedom
## Multiple R-squared:  0.6884, Adjusted R-squared:  0.6771 
## F-statistic: 61.12 on 3 and 83 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources + 
##     Adult.Mortality + HIV.AIDS, data = df1_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.206  -1.834  -0.020   2.010  10.284 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     50.554312   2.216126  22.812  < 2e-16 ***
## Income.composition.of.resources 35.477611   2.608106  13.603  < 2e-16 ***
## Adult.Mortality                 -0.018297   0.004135  -4.425 1.67e-05 ***
## HIV.AIDS                        -1.055254   0.261798  -4.031 8.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.803 on 179 degrees of freedom
## Multiple R-squared:  0.8059, Adjusted R-squared:  0.8027 
## F-statistic: 247.8 on 3 and 179 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources + 
##     Adult.Mortality + HIV.AIDS + (Adult.Mortality * HIV.AIDS) + 
##     (Income.composition.of.resources * Adult.Mortality) + (Income.composition.of.resources * 
##     HIV.AIDS), data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3664  -1.8844  -0.1826   1.8580   9.9718 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                     52.998853   3.048212  17.387
## Income.composition.of.resources                 33.733828   3.890978   8.670
## Adult.Mortality                                 -0.032444   0.015087  -2.151
## HIV.AIDS                                         2.045694   1.958163   1.045
## Adult.Mortality:HIV.AIDS                         0.004782   0.001708   2.799
## Income.composition.of.resources:Adult.Mortality  0.012899   0.023369   0.552
## Income.composition.of.resources:HIV.AIDS        -9.070358   3.124681  -2.903
##                                                 Pr(>|t|)    
## (Intercept)                                      < 2e-16 ***
## Income.composition.of.resources                 2.78e-15 ***
## Adult.Mortality                                  0.03288 *  
## HIV.AIDS                                         0.29759    
## Adult.Mortality:HIV.AIDS                         0.00569 ** 
## Income.composition.of.resources:Adult.Mortality  0.58168    
## Income.composition.of.resources:HIV.AIDS         0.00417 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.533 on 176 degrees of freedom
## Multiple R-squared:  0.8353, Adjusted R-squared:  0.8297 
## F-statistic: 148.7 on 6 and 176 DF,  p-value: < 2.2e-16
  1. Create the model
## 
## Call:
## lm(formula = Life.expectancy ~ Income.composition.of.resources + 
##     Adult.Mortality + (Income.composition.of.resources * HIV.AIDS), 
##     data = trainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2720  -1.8066  -0.1358   1.5006   9.7367 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              48.328934   2.339859  20.655  < 2e-16
## Income.composition.of.resources          39.461755   2.795040  14.118  < 2e-16
## Adult.Mortality                          -0.022092   0.004253  -5.195 7.03e-07
## HIV.AIDS                                  4.341832   1.437251   3.021 0.002993
## Income.composition.of.resources:HIV.AIDS -9.631094   2.763085  -3.486 0.000655
##                                             
## (Intercept)                              ***
## Income.composition.of.resources          ***
## Adult.Mortality                          ***
## HIV.AIDS                                 ** 
## Income.composition.of.resources:HIV.AIDS ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.504 on 141 degrees of freedom
## Multiple R-squared:  0.8421, Adjusted R-squared:  0.8376 
## F-statistic: 187.9 on 4 and 141 DF,  p-value: < 2.2e-16

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
  1. Compare model 1 vs. model 2
##                                                 2.5 %      97.5 %
## (Intercept)                               43.70319355 52.95467477
## Income.composition.of.resources           33.93615217 44.98735854
## Adult.Mortality                           -0.03049973 -0.01368488
## HIV.AIDS                                   1.50048435  7.18317931
## Income.composition.of.resources:HIV.AIDS -15.09352371 -4.16866375
##   Models       CV      AIC      BIC     AdjR2 Accuracy
## 1 model1 19.84151 549.0920 558.7205 0.7317107     0.76
## 2 model2 12.74281 373.0652 390.9669 0.8375767     0.85
  1. Comment on the differences of the models and whether model 2 brings any benefit

We notice that model 2 (the more complex of the two models) has a lower CV PRESS and higher adjusted R2. While model 1 is simple to comprehend, model 2 has higher predictability powers.

Objective 2

- Nonparametric technique
- kNN or regression trees (select one)
  1. Model
## 
## Call:
##  randomForest(formula = Life.expectancy ~ ., data = trainingData) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 7.513744
##                     % Var explained: 89.99

## 
## Call:
##  randomForest(formula = Life.expectancy ~ ., data = new.train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 9.186914
##                     % Var explained: 87.76
## NULL